OBIS is small enough to analyse by itself on a decent laptop computer or a small cluster. We have previously visualized ES50 for OBIS data by decade.
But adding GBIF has an order of magnitude more records than OBIS, so combining the two takes it to another level. Options for tackling this volume of data include increasing the memory in your cluster or splitting up the analysis into more manageable chunks. Here we demonstrate the latter.
I had to make the analysis work on one of three resources:
My government laptop:
An AWS cloud instance with 8 cpu and 32 GB of RAM
The USGS HPC Denali
I ended up using the last option. Either of the other two were able to handle smaller subsets of the data, like the continental USA, but couldn’t handle all quadrants of the globe for the full 2010s.
The data are ~ 185 GB, representing 2.4 billion records and XX columns. The parquet file format helps, but generally chokes when trying to calculate the diversity metrics because:
As the dask
documentation points out, sometimes you need to chunk up your data.
I tried using dask and spark in addition to R
arrow but stopped putting effort into those methods because
I was less comfortable with python and java then R and it became
apparent that no matter what I would have to process the data in chunks
to compute the global data set on the resources I had access to. The R
package sparklyr required installation of a Spark
environment, something I couldn’t do without IT intervention, so I
didn’t pursue it. Similarly, while it was easy to spin up a dask cluster
on AWS, I couldn’t easily do that on our HPC, so I didn’t put effort
into pursuing it. But going forward, both options could probably
accomplish the same thing as this R work flow.
We created maps with both continuous color schemes and with binned color schemes. We found that in the former it was difficult to differentiate the middle values. Our conclusion was that precise values were not as useful as being able to quickly understand where biodiversity was extremely low, high, and what the approximate “normal” was for various regions.
| Continuous Color | Discrete Color |
|---|---|
See my explanation of ESi for background on the metric.
We examined the relationship between ESi, Richness, Shannon Diversity, and Hill Number of 1. Ultimately, it was clear that there was a strong correlation between Shannon Diversity and ESi for this set of data. In the image below, the panels represent different h3 resolutions. There didn’t seem to be a strong influence of resolution on the relationship, or either metric.
This is probably because:
However, I think it’s arguable that the output of ESi is more intuitive to humans than Shannon Diversity while avoiding the extreme effects of sampling bias that are associated with richness.
We weren’t sure what the best sampling depth was for ESi for these global datasets. We wanted to achieve something that could resolve real differences in diversity, while discarding as few cells as possible for lack of data. Some past studies had looked at ES50 and ES100, so we did the same. We found that ES100 worked really well for areas like the United States where there is an abundance data. ES50 didn’t provide as much nuance, but was able to provide values for more cells.
| ES50 | ES100 |
|---|---|
After quality filtering (we still haven’t addressed “issue” flags), we had data in 533564 of 2016842 cells, or about 1/4 of the cells on the globe. Of these, we could calculate a non-NA value for about 30% of them when using ES50 (so about 8% of the total cells on the globe) vs 23% with ES100 (about 6% of the total cells on the globe).
We struggled with consistently implementing the dggridR system. We had too many collaborators struggle with installation to make it seem like a good system with regard to reproduciblity. The h3 system is a nice grid solution for a variety of reasons, and we were intrigued by the nesting properties, which have proven to be useful for exploring this data and our analysis directions.
However, hexagons are more computationally costly than rectangles and represent a hurdle in generating an sharing the data. Furthermore, the hexagons in the h3 system vary in area relative to the icosahedron vertices, so they are not as statistically rigorous as the dggridr system.
If we use a coarser resolution, than we can reduce the number of NA values with ESi. This is very pleasing to look at, but paints some polities in too broad of strokes in addition to giving the illusion that our data covers the earth much more completely that is true. For example, at a resolution of 3, 20% of the countries in the world are represented by ~ 1 hexagon, or less. We chose to work with a h3 resolution of 5. This resolution means each hexagon is ~ 253 km^2, on average. We chose this after examining some sample data at resolutions 2 - 7. We felt a resolution of 5 balanced our desire to show hi-resolution diversity metrics without discarding too much data that was too sparse to meet the requirments of the Hurlbert (AKA ES) index). For perspective, at this resolution, the Great Barrier Reef contains about 1375 hexagons and the island country of Barbados would be covered by about 2 hexagons.
We’re going to visualize our biodiversity values as h3 hexagons and by decade, since these are two useful units for aggregating the data into a meaningful amount of time and space. So our strategy will be to:
It is much faster to work with these large parquet files locally than
calling an S3 bucket via arrow or another package. Every
month, GBIF publishes a snapshot of their database via AWS. The OBIS snapshot is
published every XXXX and can be found at https://obis.org/data/access/#
Download these parquet files to your local machine as the raw data for this analysis.
In addition to removing undeeded columns, we’ll only work with the most recent decade of data for this vignette. A few caveats about the code:
We’re doing some filtering based on recommendations from other GBIF/OBIS users (linked below). That being said, I haven’t yet dealt with the “issues” column, as recommended, because the array format was difficult to deal with before the “collect” function. Eventually, issues will be filtered out too.
As fantastic as the R arrow package is, it doesn’t
have the full dplyr function implemented yet. For example
%in% and != aren’t implemented yet in filter
arrow. So you’ll see a few places where the code is more verbose than it
typically needs to be for filtering the data.
#Downloaded most recent snapshot of OBIS and uploaded to HPC via GLOBUS
#Repartition by year and hemisphere (AKA quadrant) after indexing to h3
library(dplyr)
library(arrow)
setwd("/caldera/projects/css/sas/bio/spec_obs/global_biodiversity_map/")
#generate h3 indices by hemisphere
source("src/R/HPC/h3_index_by_hemisphere.R")
#Reduce parquet to columns that are needed for analysis
df <- open_dataset(sources = "data/raw/obis/snapshot/obis_20221006.parquet")
#select columns
desired_cols <- c("id",
"decimalLongitude",
"decimalLatitude",
"eventDate",
"date_year",
"countryCode",
"basisOfRecord",
"occurrenceStatus",
"flags",
"dropped",
"kingdom",
"class",
"species")
tictoc::tic()
occ <- df %>%
select(all_of(desired_cols)) %>%
filter(date_year >= 2010) %>%
filter(date_year <= 2019) %>%
filter(occurrenceStatus == "Present" | occurrenceStatus == "present") %>% #drops 3e5 records unless you account for inconsistent capitalization.
filter(basisOfRecord == "Observation" |
basisOfRecord == "MachineObservation" |
basisOfRecord == "HumanObservation" |
basisOfRecord == "MaterialSample" |
basisOfRecord == "PreservedSpecimen" |
basisOfRecord == "Occurrence") %>%
filter(!is.na(species)) %>%
filter(!is.na(decimalLatitude)) %>%
select(-basisOfRecord,
-occurrenceStatus,
-countryCode) %>%
collect()
tictoc::toc() #took 3 seconds
#From a quick look, OBIS will add about 3.3e5 records to the CONUS 2010s analysis. But this will be reduced after merging with GBIF and duplicates are removed.
#Rename columns to match OBIS convention and write back out to parquet
occ <- occ %>%
rename(obisid = id,
decimallongitude = decimalLongitude,
decimallatitude = decimalLatitude,
year = date_year,
eventdate = eventDate)
Now we need to make the h3 and quadrant indices so we can apply them to our occurrence records. First index the h3 cells to quadrants…
#h3 can't polyfill beyond 180 degree arcs
library(dplyr)
CRS <- sf::st_crs(4326)
NW1 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 0,
xmax = -90,
ymin = 0,
ymax = 90
), crs = CRS
)))
NW2 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = -90,
xmax = -180,
ymin = 0,
ymax = 90
), crs = CRS
)))
NW_hex_ids <- c(h3::polyfill(NW1, res = 5),
h3::polyfill(NW2, res = 5))
NE1 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 0,
xmax = 90,
ymin = 0,
ymax = 90
), crs = CRS
)))
NE2 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 90,
xmax = 180,
ymin = 0,
ymax = 90
), crs = CRS
)))
NE_hex_ids <- c(h3::polyfill(NE1, res = 5),
h3::polyfill(NE2, res = 5))
SW1 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 0,
xmax = -90,
ymin = 0,
ymax = -90
), crs = CRS
)))
SW2 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = -90,
xmax = -180,
ymin = 0,
ymax = -90
), crs = CRS
)))
SW_hex_ids <- c(h3::polyfill(SW1, res = 5),
h3::polyfill(SW2, res = 5))
SE1 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 0,
xmax = 90,
ymin = 0,
ymax = -90
), crs = CRS
)))
SE2 <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 90,
xmax = 180,
ymin = 0,
ymax = -90
), crs = CRS
)))
SE_hex_ids <- c(h3::polyfill(SE1, res = 5),
h3::polyfill(SE2, res = 5))
…then index the OBIS records to both.
#Add h3 index, hemisphere and rewrite as parquet
tictoc::tic()
occ$cell <- occ %>%
select(decimallatitude,
decimallongitude) %>%
h3::geo_to_h3(latlng = .,
res = 5)
occ$hemisphere <- case_when(occ$cell %in% NW_hex_ids ~ "NW",
occ$cell %in% NE_hex_ids ~ "NE",
occ$cell %in% SW_hex_ids ~ "SW",
occ$cell %in% SE_hex_ids ~ "SE")
tictoc::toc() #took 30 seconds
Write the object to parquet files, optimizing future analysis by partitioning by year and quadrant.
tictoc::tic()
write_dataset(dataset = occ,
format = "parquet",
path = "data/processed/OBIS_2010s/",
partitioning = c("year","hemisphere"),
existing_data_behavior = "overwrite")
tictoc::toc() #took 16 seconds
This is very similar to the OBIS cleaning above. Mostly there are only differences in the capitalization of column names (why?) Since we have already created quadrant indices, we don’t need to do that again. One big difference is that the GBIF data is big enough that the resources we had on hand weren’t sufficient to process 2010 - 2019 in one go. So instead we loop through each partition. This took about 1-3 seconds for each of the partitions, so about 50 minutes total.
#Downloaded most recent snapshot of GBIF via GLBUS.
# Repartition by year and hemisphere after indexing to h3
library(dplyr)
library(arrow)
setwd("/caldera/projects/css/sas/bio/spec_obs/global_biodiversity_map/")
#generate h3 indices by hemisphere
source("src/R/HPC/h3_index_by_hemisphere.R")
#list all partitions
partitions <- list.files(path = "data/raw/gbif/snapshot/2023-01-01/occurrence.parquet/", full.names = TRUE)
tictoc::tic()
lapply(partitions, function(x) {
print(x) #acts as a quasi-progress bar
tictoc::tic()
occ <- arrow::open_dataset(sources = x)
# %in% and != aren't implemented yet in filter arrow
occ <- occ %>%
select(
gbifid,
#issue, #will be incorporated in the future, array is irritating to parse with R tools
basisofrecord,
occurrencestatus,
countrycode,
year,
decimallongitude,
decimallatitude,
species,
kingdom,
class,
eventdate
) %>%
filter(year >= 2010) %>%
filter(year <= 2019) %>%
filter(occurrencestatus == "PRESENT") %>%
filter(basisofrecord == "OBSERVATION" |
basisofrecord == "MACHINE_OBSERVATION" |
basisofrecord == "HUMAN_OBSERVATION" |
basisofrecord == "MATERIAL_SAMPLE" |
basisofrecord == "PRESERVED_SPECIMEN" |
basisofrecord == "OCCURRENCE"
) %>%
filter(!is.na(species)) %>%
filter(!is.na(decimallatitude)) %>%
select(-basisofrecord,-occurrencestatus,-countrycode) %>%
collect()
#Add h3 index, hemisphere and rewrite as parquet
occ$cell <- occ %>%
select(decimallatitude,
decimallongitude) %>%
h3::geo_to_h3(latlng = .,
res = 5)
occ$hemisphere <- case_when(occ$cell %in% NW_hex_ids ~ "NW",
occ$cell %in% NE_hex_ids ~ "NE",
occ$cell %in% SW_hex_ids ~ "SW",
occ$cell %in% SE_hex_ids ~ "SE")
#write to parquet
partition_name <- basename(x)
write_dataset(dataset = occ,
path = paste0("data/processed/GBIF_2010s/",
partition_name),
partitioning = c("year", "hemisphere"),
existing_data_behavior = "overwrite")
tictoc::toc()
})
time_total <- tictoc::toc() #took about 48 min on USGS HPC Denali. Each partition took 1 -3 seconds.
We will use a modified version of the obisindicators
function, obis_calc_indicators, to calculate several
indices at the same time. Run
?obisindicators::calc_indicators to learn more about the
input and output structure.
I found that the function currently in the
obisindicators package is a little slow. I think this is
because it depends on a lot of tidyverse functions. While
more readable, it can be sped up by several orders of magnitude if it is
rewritten in data.table like so:
#obisindicators::calc_indicators() rewritten in data.table for speed
obis_calc_indicators_improved_dt <- function (df, esn = 50)
{
stopifnot(is.data.frame(df))
stopifnot(is.numeric(esn))
stopifnot(all(c("cell", "species", "records") %in% names(df)))
dt <- data.table::as.data.table(df)
dt <- dt[, .(ni = sum(records)),
by = .(cell, species)]
dt[, `:=`(n = sum(ni)),
by = cell]
dt[, `:=`(hi = -(ni / n * log(ni / n)),
si = (ni / n) ^ 2,
qi = ni / n,
esi = ifelse(n - ni >= esn,
1 - exp(gsl::lngamma(n - ni + 1) + gsl::lngamma(n - esn + 1) - gsl::lngamma(n - ni - esn + 1) - gsl::lngamma(n + 1)),
ifelse(n >= esn,
1,
NA))),
by = cell]
dt <- dt[, .(n = sum(ni),
sp = .N,
shannon = sum(hi),
simpson = sum(si),
maxp = max(qi),
es = sum(esi)), by = .(cell)]
dt[, `:=`(hill_1 = exp(shannon),
hill_2 = 1 / simpson,
hill_inf = 1 / maxp),
by = cell]
return(dt)
}
We loop through the quadrants of the globe, aggregated as a decade. This took about 15-20 minutes for me..
#Join obis and gbif and make map
tictoc::tic()
setwd("/caldera/projects/css/sas/bio/spec_obs/global_biodiversity_map/")
library(arrow)
library(dplyr)
library(sf)
library(ggplot2)
source("src/R/HPC/gmap_discrete.R")
source("src/R/HPC/obis_calc_indicators_improved_dt.R")
#set ES sampling depth
es_depth <- 50
#set kingdoms to include
kings <- "Plantae"
#load data
obis <- open_dataset(sources = "data/processed/OBIS_2010s/")
gbif <- open_dataset(sources = "data/processed/GBIF_2010s/")
# Split in to hemispheres ------
hemispheres <- c("NE", "NW", "SE", "SW") #need to investigate what to do with NA values from hemisphere
#x <- hemispheres[2] #for testing. Looks like NE + NW can't be handled on my computer, but either one individually works fine.
# On Denali, NE hemisphere took about 4 min
tictoc::tic()
bio_idx <- lapply(hemispheres, function(x) {
tictoc::tic()
obis_comp <- obis %>%
select(#year, not needed save memory
species,
cell,
hemisphere,
kingdom) %>%
filter(hemisphere == x) %>%
select(-hemisphere) %>%
filter(kingdom==kings) %>%
collect()
gbif_comp <- gbif %>%
select(#year, not needed, save memory
species,
cell,
hemisphere,
kingdom) %>%
filter(hemisphere == x) %>%
select(-hemisphere) %>%
filter(kingdom==kings) %>%
collect()
done <- list(obis_comp, gbif_comp)
done <- data.table::rbindlist(done)
done <- done[,
.N, #change this name to "records"
by = .(cell,
#year,
species)]
done <- done %>% rename(records = N)
#Calculate diversity indices
bio_idx <-
done %>% obis_calc_indicators_improved_dt(esn = es_depth) %>%
select(cell, es) #save memory
tictoc::toc()
return(bio_idx)
}) %>%
data.table::rbindlist()
tictoc::toc()
#For 2010s took about 13 minutes. NE = 2.5 min, NW = 7 min, SE = 2 min, SW = 1.5 min. Table size == 533,546 x 2
The global h3 grid has to be generated for the resolution you did the indexing at, in our case resolution == 5. You can read more about this here. This resolution means each hexagon is ~ 253 km^2, on average. We chose this after examining some sample data at resolutions 2 - 7. We felt a resolution of 5 balanced our desire to show hi-resolution diversity metrics without discarding too much data that was too sparse to meet the requirments of the Hurlbert (AKA ES) index). For perspective, at this resolution, the Great Barrier Reef contains about 1375 hexagons and the island of Barbados would be covered by about 2 hexagons.
One important thing to keep in mind is that you only need to generate
this grid once. Then you can stash it as an rds object for
any subsequent runs. Each increase in resolution corresponds to about an
order of magnitude increase in the number of cells. I noticed a real
slowdown above 5. Using future and furrr to
parallelize it helped some.
make_h3_grids_parallel <- function (hex_res = 2){
CRS <- sf::st_crs(4326)
east <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = 0,
xmax = 180,
ymin = -90,
ymax = 90
), crs = CRS
)))
west <- sf::st_sf(geom = sf::st_as_sfc(sf::st_bbox(
c(
xmin = -180,
xmax = 0,
ymin = -90,
ymax = 90
), crs = CRS
)))
hex_ids <-
c(h3::polyfill(east, res = hex_res),
h3::polyfill(west,
res = hex_res))
dl_offset <- 60
future::plan(strategy = "multisession",
workers = future::availableCores() * 0.75)
#make sliding chunk size with res
chunk <- dplyr::case_when(hex_res == 4 ~ 2e4,
hex_res == 5 ~ 5e4,
hex_res == 6 ~ 1e5,
TRUE ~ 1e4)
n <- length(hex_ids)
r <- rep(1:ceiling(n / chunk), each = chunk)[1:n]
hex_ids <- split(hex_ids, r)
hex_sf <-
furrr::future_map_dfr(hex_ids, function(x) {
h3::h3_to_geo_boundary_sf(unlist(x))
},
.options = furrr::furrr_options(seed = NULL)) %>%
sf::st_as_sf(data.table::rbindlist(.)) %>%
sf::st_wrap_dateline(c(
"WRAPDATELINE=YES",
glue::glue("DATELINEOFFSET={dl_offset}")
)) %>%
dplyr::rename(hex_ids = h3_index)
future::plan(strategy = "sequential")
return(hex_sf)
}
h3_grids <- make_h3_grids_parallel(hex_res = 5)
#Made separately to save time and reuse
h3_grids <- readRDS("data/processed/h3_res_1to5_globalgrid.rds")[[5]]
#join grid and indicators----
tictoc::tic()
grid <- inner_join(h3_grids,
bio_idx,
by = c("hex_ids" = "cell"))
tictoc::toc() #took 16 seconds
I used a custom function base off something Matt Biddle wrote, and modified it with some elements from some of John Waller’s plots so I could play with a few key parameters.
#Plot function, based off one written by Matt Biddle. I extracted out some part he put into the function, so it would be easier to change them
#partly based on John Waller's plots: https://github.com/jhnwllr/es50
gmap_discrete <- function(grid,
column = "shannon",
label = "Shannon index",
trans = "identity",
crs = "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs",
limits = c(0, 8)) {
#bounding box for Contiguous US (CONUS) + buffer
# West_Bounding_Coordinate <- -124.211606 - 0.5
# East_Bounding_Coordinate <- -67.158958 - 0.5
# North_Bounding_Coordinate <- 49.384359 - 0.5
# South_Bounding_Coordinate <- 25.837377 - 0.5
#
# bb_roi <- paste('POLYGON ((',
# West_Bounding_Coordinate,South_Bounding_Coordinate, ",",
# East_Bounding_Coordinate, South_Bounding_Coordinate, ",",
# East_Bounding_Coordinate, North_Bounding_Coordinate, ",",
# West_Bounding_Coordinate, North_Bounding_Coordinate, ",",
# West_Bounding_Coordinate, South_Bounding_Coordinate, '))')
bb_roi <- 'POLYGON ((-180 -90, 180 -90, 180 90, -180 90, -180 -90))'
sfc <- st_as_sfc(bb_roi, crs = '+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0')
bb <- sf::st_bbox(st_transform(sfc, crs))
breaks = seq(0, max(limits), max(limits)/5)
grid[[column]] <- cut(grid[[column]], breaks = breaks)
#grid[[column]] <- as.factor(grid[[column]])
#grid[[column]] <- addNA(grid[[column]])
ggplot() +
geom_sf(
data = grid,
aes_string(fill = column, geometry = "geometry"),
show.legend = TRUE,
#lwd = 0.01
color = NA # https://github.com/tidyverse/ggplot2/issues/3744
) +
scale_fill_brewer(palette = "Spectral", direction=-1, na.value = 'skyblue1') +
geom_sf(
data = world,
fill = NA,
color = "grey",
lwd = 0.1
) +
# geom_sf(
# data = USA_shape,
# fill = NA,
# color = "#000000",
# lwd = 0.1
# ) +
coord_sf(crs = crs,
xlim = bb[c("xmin", "xmax")],
ylim = bb[c("ymin", "ymax")]) +
theme(
panel.background = element_rect(
fill = "white",
colour = "gray95",
size = 0,
linetype = "blank"
),
axis.ticks = element_blank(),
#axis.title.x = element_blank(),
#axis.title.y = element_blank(),
#axis.text = element_blank(),
#panel.grid = element_blank(),
plot.margin = margin(0, 0, 0, 0, "cm")
)
}
Make the map. Generating and/or saving the map can take about 2-3 minutes if you increase the resolution.
# map defaults
RES <- 5 # defined at https://h3geo.org/docs/core-library/restable/
column <- "es"
label <- paste0("ES", es_depth) #legend label
trans <- "identity"
crs <-
"+proj=robin +lat_ts=0 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs"
limits <- c(0, 100) #legend limits
#states
USA_shape <- rnaturalearth::ne_states(returnclass = "sf", country = "United States of America")
world <- rnaturalearth::ne_countries(scale = "small",
returnclass = "sf")
#Plot function, based off one written by Matt Biddle. I extracted out some part he put into the function, so it would be easier to change them
#partly based on John Waller's plots: https://github.com/jhnwllr/es50
map <- gmap_discrete(
grid,
column,
label = label,
trans = trans,
crs = crs,
limits = limits
)
tictoc::tic()
ggsave(filename = paste0("output/tests/HPC/", label, "_global_2010s_res5_OBIS+GBIF_", kings, "_", Sys.Date(), ".png"),
plot = map,
width = 6400,
height = 3200,
units = "px",
scale = 1,
limitsize = FALSE)
tictoc::toc() #took 2.5 min
total_runtime <- tictoc::toc()
Note: I wasn’t happy with how these were rendering in the notebook, so I quickly cropped some whitespace off the sides via Windows.
| ES100 - All kingdoms, 2010s |
|---|
| ES50 - All kingdoms, 2010s |
|---|
| ES50 - Animal kingdom only, 2010s |
|---|
| ES50 - Plant kingdom only, 2010s |
|---|